In this study, we explore the representation mapping from the domain of visual arts to the domain of music, with which we can use visual arts as an effective handle to control music generation. Unlike most studies in multimodal representation learning that are purely data-driven, we adopt an analysis-by-synthesis approach that combines deep music representation learning with user studies. Such an approach enables us to discover \textit{interpretable} representation mapping without a huge amount of paired data. In particular, we discover that visual-to-music mapping has a nice property similar to equivariant. In other words, we can use various image transformations, say, changing brightness, changing contrast, style transfer, to control the corresponding transformations in the music domain. In addition, we released the Vis2Mus system as a controllable interface for symbolic music generation.
translated by 谷歌翻译
在计算机音乐和心理声学中,感知响度与身体属性之间的关系是一个重要的主题。对“相等大通轮廓”的早期研究可以追溯到1920年代,从那以后,对强度和频率进行了测量的响度已被修订了多次。然而,大多数研究仅关注合成的声音,并且很少有合理的自然色调理论。为此,我们通过建模钢琴音调在本文中研究了天然音调感知的理论和应用。该理论部分包含:1)对音高的钢琴相等大小轮廓的准确测量,以及2)一个机器学习模型,能够纯粹基于基于人类主题测量的光谱特征来推断响度。至于应用程序,我们将理论应用于钢琴控制转移,其中我们调整了两个不同玩家钢琴(在不同的声学环境中)上的MIDI速度,以达到相同的感知效果。实验表明,我们的理论响度建模和相应的性能控制转移算法都显着优于其基准。
translated by 谷歌翻译
音乐包含超出节拍和措施的层次结构。尽管层次结构注释有助于音乐信息检索和计算机音乐学,但在当前的数字音乐数据库中,这种注释很少。在本文中,我们探讨了一种数据驱动的方法,以自动从分数中提取分层的度量结构。我们提出了一个具有时间卷积网络条件随机字段(TCN-CRF)体系结构的新模型。给定符号音乐得分,我们的模型以良好的形式采用任意数量的声音,并预测了从偏低级别到截面级别的4级层次级别结构。我们还使用RWC-POP MIDI文件来注释数据集,以促进培​​训和评估。我们通过实验表明,在不同的编排设置下,提出的方法的性能优于基于规则的方法。我们还对模型预测进行了一些简单的音乐分析。所有演示,数据集和预培训模型均在GitHub上公开可用。
translated by 谷歌翻译
抒情的解释可以帮助人们快速理解歌曲及其歌词,还可以使管理,检索和发现音乐档案不断增长,从而更加容易地检索和发现歌曲。在本文中,我们提出了Bart-Fusion,这是一种新型模型,用于从歌词和音乐音频中生成歌词解释,该模型将大规模的预训练的语言模型与音频编码器结合在一起。我们采用跨模式注意模块将音频表示形式纳入歌词表示形式,以帮助预先训练的语言模型从音频的角度了解歌曲,同时保留语言模型的原始生成性能。我们还发布了歌曲解释数据集,这是一个新的大型数据集,用于培训和评估我们的模型。实验结果表明,其他音频信息有助于我们的模型更好地理解单词和音乐,并产生精确和流利的解释。跨模式音乐检索的另一个实验表明,巴特融合产生的解释也可以帮助人们比原始的巴特更准确地检索音乐。
translated by 谷歌翻译
我们可以根据流行歌曲的音频自动推导钢琴伴奏的分数吗?这是我们在本文中解决的音频到符号排列问题。一个良好的安排模型不仅要考虑音频内容,还要先前了解钢琴组成(使得生成“听起来像”音频且同时保持音乐性。)到目前为止,我们贡献了跨模型表示学习模型,其中1)从音频提取和弦和旋律信息,2)从音频和损坏的地面真理安排中了解纹理表示。我们进一步介绍了定制的培训策略,逐渐将纹理信息的来源从损坏的分数转移到音频。最后,基于分数的纹理后部减少到标准的正态分布,并且只需要音频进行推断。实验表明,我们的模型捕获了主要音频信息和优于代质量的基线。
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. With intensive experiments on Chinese and English data sets, our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate. This is achieved without impact on the decoding efficiency. Overall, we demonstrate the effectiveness of our proposed method in boosting the ASR decoding performance, especially for long-tail tokens.
translated by 谷歌翻译
It is crucial to evaluate the quality and determine the optimal number of clusters in cluster analysis. In this paper, the multi-granularity characterization of the data set is carried out to obtain the hyper-balls. The cluster internal evaluation index based on hyper-balls(HCVI) is defined. Moreover, a general method for determining the optimal number of clusters based on HCVI is proposed. The proposed methods can evaluate the clustering results produced by the several classic methods and determine the optimal cluster number for data sets containing noises and clusters with arbitrary shapes. The experimental results on synthetic and real data sets indicate that the new index outperforms existing ones.
translated by 谷歌翻译
Generalizability to unseen forgery types is crucial for face forgery detectors. Recent works have made significant progress in terms of generalization by synthetic forgery data augmentation. In this work, we explore another path for improving the generalization. Our goal is to reduce the features that are easy to learn in the training phase, so as to reduce the risk of overfitting on specific forgery types. Specifically, in our method, a teacher network takes as input the face images and generates an attention map of the deep features by a diverse multihead attention ViT. The attention map is used to guide a student network to focus on the low-attended features by reducing the highly-attended deep features. A deep feature mixup strategy is also proposed to synthesize forgeries in the feature domain. Experiments demonstrate that, without data augmentation, our method is able to achieve promising performances on unseen forgeries and highly compressed data.
translated by 谷歌翻译
This paper presents a novel framework for planning in unknown and occluded urban spaces. We specifically focus on turns and intersections where occlusions significantly impact navigability. Our approach uses an inpainting model to fill in a sparse, occluded, semantic lidar point cloud and plans dynamically feasible paths for a vehicle to traverse through the open and inpainted spaces. We demonstrate our approach using a car's lidar data with real-time occlusions, and show that by inpainting occluded areas, we can plan longer paths, with more turn options compared to without inpainting; in addition, our approach more closely follows paths derived from a planner with no occlusions (called the ground truth) compared to other state of the art approaches.
translated by 谷歌翻译